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Abstract In this work we present a review of the state of the art of Learning 
Vector Quantization (LVQ) classifiers. A taxonomy is proposed which inte¬ 
grates the most relevant LVQ approaches to date. The main concepts associ¬ 
ated with modern LVQ approaches are defined. A comparison is made among 
eleven LVQ classifiers using one real-world and two artificial datasets. 
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1 Introduction 

Learning Vector Quantization (LVQ) is a family of algorithms for statistical 
pattern classification, which aims at learning prototypes (codebook vectors) 
representing class regions. The class regions are defined by hyperplanes be¬ 
tween prototypes, yielding Voronoi partitions. In the late 80’s Teuvo Kohonen 
introduced the algorithm LVQl |36l 1^ . and over the years produced several 
variants. Since their inception LVQ algorithms have been researched by a small 
but active community. A search on the ISI Web of Science in November, 2013, 
found 665 journal articles with the keywords “Learning Vector Quantization” 
or “LVQ” in their titles or abstracts. This paper is a review of the progress 
made in the field during the last 25 years. 
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LVQ algorithms are related to other competitive learning algorithms such 
as self-organizing maps (SOMs) |38] and c-means. Competitive learning algo¬ 
rithms are based on the winner-take-all learning rule, and variants in which 
only certain elements or neighborhoods are updated during learning. The 
original LVQ algorithms and most modern extensions use supervised learn¬ 
ing for obtaining class-labeled prototypes (classifiers). However, LVQ can also 
be trained without labels by unsupervised learning for clustering purposes 
0 ttZl 1311 [Ml IMl I13- In this paper we will focus our review only on LVQ 
classifiers. 

LVQ classifiers are particularly intuitive and simple to understand because 
they are based on the notion of class representatives (prototypes) and class 
regions usually in the input space (Voronoi partitions). This is an advantage 
over multilayer perceptrons or support vector machines (SVMs), which are 
considered to be black boxes. Moreover, support vectors are extreme values 
(those having minimum margins) of the datasets, while LVQ prototypes are 
typical vectors. Another advantage of LVQ algorithms is that they are simple 
and fast, as a result of being based on Hebbian learning. The computational 
cost of LVQ algorithms depends on the number of prototypes, which are usually 
a fixed number. SVMs depend on the number of training samples instead, 
because the number of support vectors is a fraction of the size of the training 
set. LVQ has been shown to be a valuable alternative to SVMs nanz]. 

LVQ classifiers try to approximate the theoretical Bayesian border, and can 
deal directly with multi-class problems. The initial LVQ learning rules were 
heuristic, and showed sensitivity to initialization, slow convergence problems 
and instabilities. However, two main approaches have been proposed defining 
explicit cost functions from which to derive learning rules via steepest descent 
or ascent |561 1531IM] , and solving the problem of convergence of the original 
LVQ algorithms. The first model is a generalization of LVQ called Generalized 
Learning Vector Quantization (GLVQ) |56]. In GLVQ a cost function is defined 
in such a way that a learning rule is derived via the steepest descent. This 
cost function has been shown to be related to a minimization of errors, and 
a maximization of the margin of the classifier [7]. The second approach is 
called Robust Soft-LVQ (RSLVQ) [Ml IH3| . in which a statistical objective 
function is used to derive a learning rule by gradient ascent. The probability 
density function (pdf) of the data is assumed to be a Gaussian mixture for 
each class. Given a data point, the logarithm of the ratio of the pdf of the 
correct class versus the pdf’s of the incorrect classes serves as a cost function 
to be maximized. 

Other LVQ improvement deals with the initialization sensitivity of the 
original LVQ algorithms and GLVQ [niEiiissiiii. Recent extensions of the 
LVQ family of algorithms substitute the Euclidean distance with more general 
metric structures such as: weighted Euclidean metrics |18| . adaptive relevance 
matrix metrics m, pseudo-Euclidean metrics |22| , and similarity measures in 
kernel feature space that lead to kernelized versions of LVQ |52| . 

There are thousands of LVQ applications such as in: image and signal 
processing ISlIHllsgilollllllMl ESI, the biomedical field and medicine Emu 
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Fig. 1 Taxonomy of the most relevant Learning Vector Quantization classifiers since the 
seminal work of Teuvo Kohonen in the late 80s. 


[131 HU HE HE |SE |S7], and industry [HIUIE ISE SIl [75] . to name just a few. 
An extensive bibliography database is available in [JS]. 

In this paper we present a comprehensive review of the most relevant su¬ 
pervised LVQ algorithms developed since the original work of Teuvo Kohonen. 
We introduce a taxonomy of LVQ classifiers, and describe the main algorithms. 
We compare the performance of eleven LVQ algorithms empirically on arti¬ 
ficial and real-world datasets. We discuss the advantages and limitations of 
each method depending on the nature of the datasets. 

The remainder of this paper is organized as follows: In section 2 a taxonomy 
of LVQ classifiers is presented. In section 3 the main LVQ learning rules are 
described. In section 4 the results obtained with eleven different LVQ methods 
in three different datasets are shown. In section 5 some open problems are 
presented. Finally, in section 6 conclusions are drawn. 


2 A Taxonomy of LVQ Classifiers 

2.1 Learning Vector Quantization 

Let X = {(xi, yi) C x {!> •••) C}\i = 1,..., N} be a training data set, where 
X = (xi, ...,xd) G R^ are IZ-dimensional input samples, with cardinality 
|X| = N; yi G {!,..., C} i = are the sample labels, and C is the 

number of classes. The neural network consists of a number of prototypes, 
which are characterized by vectors G R^, for Z = 1, and their class 

labels c(wi) G {1,...,C'} with Y = {c(wj) G {!,...,Cjlj = 1,...,M}. The 
classification scheme is based on the best matching unit (BMU) (winner-takes- 
all strategy). The receptive field of prototype is defined as follows: 


i?® = {x G X|Vwj(j ^ i) —t (i(wi,x)<(i(wj,x)} 


(1) 
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where (i(w, x) is a distance measure. Learning aims at determining the weight 
vectors (prototypes), so that the training data samples are mapped to their 
corresponding class labels. 

Fig. [3 shows a taxonomy of the most relevant LVQ classifiers developed 
since the pioneer work of Kohonen. LVQ methods are decomposed into 3 fam¬ 
ilies: Kohonen’s LVQ methods (middle branch: heuristic), methods based on 
margin maximization (left branch), and methods based on likelihood ratio 
maximization (right branch). 

The original LVQ algorithm does not have an associated cost function to 
ensure convergence. Some improvements such as LVQ2.1, LVQ3 or OLVQ m 
aim at achieving higher convergence speed or better approximation of the 
Bayesian borders. All these LVQ versions are based on Hebbian learning, i.e., 
are heuristic. 

The original LVQl |3fil |3S] corrected only the winner prototype. This al¬ 
gorithm pulled the prototypes away from the class borders. LVQl assumes a 
good initial state of the network, i.e., it requires a preprocessing method. It 
also shows sensitivity to overlapping data sets and in it, some neurons never 
learn the training patterns. The LVQ2 algorithm updates two vectors at each 
step, the winner and the runner-up. The purpose is to estimate differentially 
the decision border towards the theoretical Bayes decision border. But this 
algorithm makes corrections that are dependent on error only, and present 
some instabilities. LVQ3 corrects the LVQ2 convergence problem consisting of 
the location of prototypes changing in continued learning by adding a stability 
factor. 


2.2 Margin Maximization 

In |12| a margin analysis of the original LVQ algorithm was performed. There 
are two definitions of margin. The first one is the sample-margin, which corre¬ 
sponds to the quantification of samples which can travel through space with¬ 
out changing the classification rate of the classifier. This is the definition used 
by SVMs. The second definition is called the hypothesis-margin, which cor¬ 
responds to the quantification of the distance that the classifier (e.g. a hy¬ 
perplane) can be altered without changing the classification rate. This is the 
definition used by Adaboost. 

In the context of LVQ, the sample margin is hard to compute and nu¬ 
merically unstable m- This is because small repositionings of the prototypes 
might create large changes in the sample margin. Crammer et al. m showed 
that the decision borders of the original LVQ algorithms are hypothesis mar¬ 
gin maximizers. The margin is associated with generalization error bounds. 
Therefore, maximizing the margin is equivalent to minimizing the general¬ 
ization error. Interestingly, the bound is dimension free, but depends on the 
number of prototypes. 

The left branch of our taxonomy shown in Fig. [3 corresponds to the LVQ 
methods based on a margin maximization approach. The GLVQ |56| proposed 
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a cost function that aims at margin maximization. This approach solves some 
limitations of the original LVQ algorithms such as: slow convergence, initializa¬ 
tion sensitivity, limitations in multidimensional data where correlations exist 
between dimensions, just to name a few. 

The GLVQ cost function is defined as follows: 

N 

Eglvq = X] ^ ^ ( 2 ) 

2=1 


where (j){-) is the logistic sigmoid function, and p, is the relative distance dif¬ 
ference 


A^(x„W) 


d'^ — d 
d++d-' 


(3) 


where d'^ = d(xj,w+) is the Euclidean distance of data point from its 
closest prototype w+ having the same class label, and d~ = d{xi,w~) is 
the Euclidean distance from the closest prototype w“ having a different class 
label. The term d+(xi) — d~{'Xi) constitutes the hypothesis margin of an LVQ 
classifier according to the winner-takes all rule |12lI20| . Note that since GLVQ 
includes this margin in its cost function, it can be described as a margin 
optimization learning algorithm. Generalization bounds show that the larger 
the margin, the better the generalization ability [^. The cost function in Eq. 
^ has been extended to other distance metrics. In |M| a generalization bound 
is derived for Generalized Relevance LVQ (GRLVQ), which uses an adaptive 
metric. 


2.3 Likelihood Ratio Maximization 

In this section we describe in detail the cost function proposed by Robust 
Soft-Learning Vector Quantization (RSLVQ) |ti3l IM] . This cost function gives 
origin to the right branch of the taxonomy illustrated in Fig. [T] which is based 
on a Gaussian probabilistic model of data in forms of mixture models. It is 
assumed that the probability density function (pdf) of the data is described 
by a Gaussian mixture model for each class. The pdf of a sample that is 
generated by the Gaussian mixture model of the correct class is compared to 
the pdf of this sample that is generated by the Gaussian mixture models of 
the incorrect classes. The logarithm of the ratio between the correct mixture 
Gaussian model and the incorrect mixture Gaussian models of probability 
densities is maximized. Let W be a set of labeled prototype vectors. The 
probability density of the data is given by 

C 

p(x|W) = ^ ^ p(x|j)P(j), (4) 

{j:c(wj)=i/} 

where C is the number of classes and y is the class label of the data points 
generated by component j. Also, P{j) is the probability that data points are 
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Fig. 2 Shape of the receptive field of prototypes depending on the distance measure, (a) 
Receptive field using the Euclidean distance measure, (b) Receptive field using a diagonal 
matrix of relevance, (c) Receptive field using a full matrix of relevance. 


generated by component j of the mixture and it can be chosen identically 
for each prototype Wj. p(x|j) is the conditional pdf that the component j 
generates a particular data point x and it is a function of prototype w_, . The 
following likelihood ratio is proposed as a cost function to be maximized: 

log (5) 

where p(x, y| W) is the pdf of a data point x that is generated by the mixture 
model for the correct class y, and p(x|W) is the total probability density of 
the data point x. These probabilities are defined as follows: 


p(x„ 2 /i|W)= ^ p(x.,|j)P(j) 

{j-.c{vrj)=y) 

p(xi|W) = Y^p{yii\j)P{j). 

3 


( 6 ) 

(7) 


The conditional pdfs are assumed to be of the normalized exponential form 
p(x|j) = K{j) ■ exp/(x,wj,cr|), with 




d(x, w) 
2 ( 7 ^ ’ 


( 8 ) 


where d(x, w) is the Euclidean distance measure. Note that Eq. ([^ provides a 
way to extend LVQ to other distance metrics by changing the distance measure 


d. 


2.4 Distance Learning 

The original LVQ, GLVQ and RSLVQ methods rely on the Euclidean dis¬ 
tance. This choice of metrics assumes a spherical receptive field of prototypes, 
as shown in Fig. 1^. However, for heterogeneous datasets, there can be different 
scaling and correlations of the dimensions; and for high-dimensional datasets. 
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the estimate errors accumulate and can disrupt the classification m- In dis¬ 
tance learning the distance measure is adaptive during training, allowing us to 
get receptive fields for prototypes such as those shown in Figs.[^ and|^. In 
the generalized matrix LVQ (GMLVQ) [501 ED a generalized distance metric 
is proposed as 

d^(w, x) = (x — w)^yl(x — w), (9) 

where yl is a full D x D matrix. To get a valid metric A must be positive 
(semi-) definite. This is achieved by substituting 

A = (10) 

which yields u^zlu = = (J7^u)^ > 0 for all u, where f2 G 

The receptive field of prototype becomes 

R\ = {xG XlVwjlJ ^ d'^(wi,x) < d^(wj,x)}. ( 11 ) 

In this way arbitrary Euclidean metrics can be realized, in particular correla¬ 
tions of dimensions and rotations of axes can be taken into account. If A is 
restricted to being diagonal then Eq. is reduced to 

D 

d^{x,w) = ||x-w|j^ = '^Xj{xj -Wjf- (12) 

This simplification gives origin to what are called Generalized Relevance ap¬ 
proaches to LVQ 1 GREVOl |T511501155] . 


2.5 Kernelization 

LVQ classifiers have been kernelized |52l l59] . A mapping function 4>(-) is de¬ 
fined in order to realize a nonlinear transformation from the data space 
to a higher dimensional possibly linearly separable feature space F, such as 
follows 

$ : ^ F, x^ ^>(x). (13) 

A kernel function [55] , can be represented as a dot product and is usually 
chosen as a Gaussian kernel. 


k{xi,xj) = $(x,) • $(xj) = exp ^ . (14) 

The LVQ classifiers can then be applied in feature space, where the proto¬ 
types are represented by an implicit form 

N 

wf = Xjm^iXm), 

m—1 


(15) 




David Nova, Pablo A. Estevez 


where 7 ^ € are the combinatorial coefficient vectors. The distance in fea¬ 
ture space between an image ^(x^) and a prototype vector vfj can be directly 
computed using kernels: 


d^(xi, Wj) = ||$(x) - wf II = 


\ 


N 


^(x) - XI 7jm^(x™) 


m—1 


N 


N 


= fc(Xi,Xj) - 2 X + X ljsljtk{Xs,Xt)- (16) 


This is another form of the kernel trick, in which there is no need of knowing 
the non-linear mapping 4>(-). This approach is called kernel GLVQ (KGLVQ) 

m- 


2.6 Dis-/similarities 

Some problems do not allow a vectorial representation, e.g. alignment of sym¬ 
bolic strings. In these cases the data can be represented by pairwise similarities 
Sij = s(jx.i,Xj) or dissimilarities dij = d(xi,Xj) and their corresponding ma¬ 
trices S and D, respectively. These matrices are symmetric, i.e. S = 5* and 
D = D*, with zero diagonals. It is easy to turn similarities into dissimilarities 
and vice-versa, as shown in [IB]. This corresponds to what is called ’relational 
data’ representation. Data represented by pairwise dissimilarities with the pre¬ 
viously mentioned restrictions can always be embedded in a pseudo-Euclidean 
space |22l [28l |48] . A pseudo-Euclidean space is a vector space equipped with 
a symmetric bilinear form composed of two parts: An Euclidean one (positive 
eigenvalues), and a correction (negative eigenvalues) |3S]. The bilinear form is 
expressed as 


(x,y)p,, = x‘/p,gy, (17) 

where Ip^q is a diagonal matrix with p entries 1 and q entries —1. This pseudo- 
Euclidean space is characterized by the signature {p,q, N — p — q), where the 
first p components are Euclidean, the next q components are non-Euclidean, 
and N — p — q components are zeros. The prototypes are assumed to be linear 
combinations of data samples: 

Wj = '^ajm^rn, with X 

m m 

where Uj = (a^i,..., a^Ar) is the vector of coefficients, that describes the pro¬ 
totype Wj implicitly. Unlike kernel approaches, dis-/similarities approaches do 
not assume that the data is Euclidean. The distances between all data points 
and prototypes are computed based on pairwise data similarities as follows: 

d^(x„ Wj) = ||xi - Wjf = [D • aj]. - ^aJJDaj. 


(19) 
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Fig. 3 Schematic of the LVQ2.1 updating rule. The circles represent the prototypes; the 
square represents an input sample; and the colors indicate different class labels. In this 
example, the sample is incorrectly classified and the nearest prototype of the correct class 
moves towards the sample, while the nearest prototype of the incorrect class moves away 
from the sample. 


3 LVQ Learning Rules 


3.1 LVQ 2.1 


Among the initial variants proposed by Kohonen, the most popular is LVQ2.1 
m, which is described in detail below. As in LVQ2, two prototypes are up¬ 
dated at each step, the winner and the runner-up. One of them, w+, belongs to 
the correct class, while the other, w“, belongs to the incorrect class. But now 
either the winner or the runner-up have the same class label as the sample. 
Furthermore, the current input must fall within a window defined around 
the mid-plane of vectors w+and w” (see Fig. [^. The updating rule is as 
follows: 


w+(t -I- 1) = w+(t) -I- e{t) ■ (xi - w+(t)), if c(w+) = y 

-I- 1) = w~(t) - e{t) ■ (xi - w~(f)), if c(w”) ^ y (20) 


where w+ (w“) is the closest prototype to the input sample x^, with the same 
(different) class label as the sample class label y, and e s]0,1[ is the learning 
rate. The prototypes, however, are changed only if the data point x is close to 
the classification boundary, i.e., if it lands within a window of relative width 
uj defined by 


/fi(x,w ) d(x, w+)\ 1 —w 

\d(x, w+) ’ d(x, w“)y ^ l-i-o;’ 


( 21 ) 


where d{-) is the Euclidean distance measure and w €]0,1[ (typically set to 
w = 0.2 or w = 0.3). A didactic scheme of the LVQ2.1 update learning rule is 
illustrated in Fig. 






10 


David Nova, Pablo A. Estevez 



Fig. 4 Schematic of the GLVQ learning rule which analyzes the behavior of the two closest 
prototypes belonging to different classes. The circles represent the prototypes; the square 
represents the input sample; and the colors indicate the different class labels, (a) Rule 
behavior when an input sample with a red class label is presented, (b) Rule behavior when 
an input sample with a blue class label is presented. 


3.2 LVQ Methods Based on Margin Maximization 

Using stochastic gradient descent to minimize Eq. (§, the following learning 
rules for GLVQ are obtained: 

w+(t + 1) = w+(t) + 2 • e • ■ /r+ • (x^ - w+) 

w"(f + 1 ) = w“(t) - 2 • e • ■ fi~ ■ (xj - w~) (22) 

where /i+ = = (d-+d+)^ ^ 1[ the learning rate. A 

didactic scheme of the GLVQ learning rule is shown in Fig. 

Early LVQ versions assumed a fixed number of codebook vectors per class 
and their initial values were set using an ad-hoc method. In order to overcome 
these problems, alternative supervised versions of Neural Gas (NG) |32] and 
Growing Neural Gas (GNG) jinj have been developed. These extensions have 
been called Supervised Neural Gas (SNG) [3T] and Supervised Growing Neural 
Gas (SGNG) [3l]. 

SNG adds neighborhood cooperativity to GLVQ solving the dependency 
on the initialization. All prototypes Wj having the same class as the current 
input sample x.^ are adapted towards the data point according to their rank¬ 
ing, and the closest prototype of a different class label is moved away from the 
data sample. SGNG incorporates plasticity by selecting an adaptive number 
of prototypes during learning. Another method, the Harmonic to Minimum 
LVQ (H2MLVQ) allows adapting all prototypes simultaneously in each 
iteration. All prototypes having the same class label as the current sample 
are updated. Likewise, all prototypes with a different label from that of the 
sample are adjusted too, in which d~ and d~^ in Eq. ^ are replaced by har¬ 
monic average distances m- For more details the reader is referred to |54| . 
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The previously mentioned methods tackle the initialization sensitivity problem 
associated with the initial position of the prototypes. 

Other margin maximization methods are based on Information Theoretic 
Learning (ITL) |d9l I?)H1 [H7] . These methods use a new divergence measure pro¬ 
posed by Principe et al. |51| . based on the Cauchy-Schwarz inequality. Together 
with a consistently chosen Parzen-estimator for the densities, it gives a numer¬ 
ically well-behaved approach of information optimization prototype-based vec¬ 
tor quantization called Cauchy-Schwarz Divergence LVQ (CSDLVQ)[Sni [7D]. 
These methods use fuzzy class labels to train the classifier. 

3.2.1 Distance Learning Approaches 

In the case of GRLVQ [TH], the relevance factors can be determined by gradient 
descent, as follows: 

Xjn{t + 1 ) = - £(/>' ■ ( 23 ) 

The adaptation formulas are obtained by computing the derivatives with 
respect to w and A using the following relative distance difference: 


with = 


^(xi,W, A) = 




d-^(xj,w+) - d^(x,, w ) 
d'^(xi,w+) -I- (i^(xj, W-)' 


(24) 

^ ~ (d^(x,,w-)+d^(x,,w-))^ where d 

was defined in Eq. (12). The updating rules for w can be obtained from Eq. 


and p = 


2d^(x 




(|22| by replacing /i+ and p with those obtained from Eq. (|24| above, 
re relative dist 

p{yii,W,A) = 


Analogously, the relative distance difference for GMLVQ is as follows: 

d^(xi, w+) - d^(xi, W) 


w+) + d^{xi, w') ■ 


(25) 


The update rules for GMLVQ are obtained by minimizing the cost function 
defined in Eq. (|^ with the generalized distance metric defined in Eq. ([^ . The 
update rules for w and 17 are as follows: 


-I- 1) = ■w'*'(t) + e ■ 2 ■ (j)' ■ p'^ . A ■ {x — w'^(t)) 
w“(t -I- 1) = w“(<) — e ■ 2 ■ (j)' ■ . A ■ (x — w~(t)), 


(26) 


and 


+ 1) = -e-2-(j)' 

■{p+ ■ ((Xm - W+)[n{x - W+)]i) 

-p ■ {{xm - w;;)[i7(x - w-)]i)), 


where I and m specify matrix components, p'^ = w-)+d^(x^ 


w-)) = 


(27) 

and 


d = 


2d-^(x,,w+) 


r, and the distance d^{-) is described in Eq. (9). 














12 


David Nova, Pablo A. Estevez 


3.2.2 Kernel Approaehes 

An extension of GLVQ is the Kernel GLVQ (KGLVQ) HU . KGLVQ projects 
the data space non-linearly into a higher dimensional feature space, and then 
applies the GLVQ algorithm in this newly formed feature space. The proto¬ 


types are represented in an implicit form as shown by Eq. (15). The updating 


rules of the GLVQ algorithm in Eq. (22 1 can be generalized from the data space 


into the feature space F by using the following relative distance difference: 




d^($(x,),w+)-d^($(x,),w-) 
d^($(x,),w+) + d^($(x,),w-)’ 


(28) 


and its derivatives ^ — 

(d* (xi.w+)+d* ( Xj ,w " 

where the metric is defined in Eq. (16). 




The equivalent updating rules in terms of adjusting parameters 7 in Eq. 


(15) are: 



[1 - e ■ (j)' ■ ^ 1 +] • 7 +(t) 

[I - e ■ (f)' ■ p.+] ■ 7 + {t) + e-(p' ■ p 

[1 + e ■ (j)' ■ p-] ■ j-(t) 

[1 + e-(I)' ■ p-]--f-(t) - e-■ p 


if Xs yf Xj 
+ if Xs = Xj 

if Xg yf Xi 
if X., = X,; 


(29) 


where s specifies combinatorial coefficient vectors of 7 , is the coefficient 
vector associated with the nearest prototype w+having the same class label 
as Xi, and 7 “ is the coefficient vector associated with the nearest prototype 
w“ having a different class label from x^. 

Furthermore, in m KGLVQ is extended to get the Nystrom-Approximation 
Generalized LVQ (AKGLVQ), where sparsity is imposed and a Nystrom ap¬ 
proximation technique is used to reduce the learning complexity in large data 
sets. 

3.2.3 Dis-/similarities Approaches 


In HU ^ 116 W method was proposed using dis-/similarity data by representing 
it in an implicit form. This approach is extended to the margin maximization 
technique giving origin to the relational GLVQ (RGLVQ). 

An implicit representation of the prototypes is assumed as described through 


a model by Eq. (18 1 . Once again the starting point is the GLVQ updating rule 


(Eq. (22 1 ) where the relative distance difference is now computed as 


p{xi,W,D) = 


d°(x„w+)-dP(x^,v^r ) 
d°(x,,w+) + d^{xi,w-)' 


(30) 


where dP{-) is the distance defined in Eq. (191, and p^ = 


2dP{xi,-w ) 


(d^(xi,w+)+d^(xi,w~))^ 


and p = — 2 d (x, w ) -^ respective partial derivatives of 

^ (a^{xi,w+)+a-^(xi,w“)) ^ ^ 
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the relative distance difference. The dij are dissimilarity matrix elements of 
D G and are the coefficient vector elements associated with the 

nearest prototype w+ having the same class label as x^, and the nearest pro¬ 
totype w~ having a different class label from x^, respectively. The equivalent 
updating rules for the a parameters are: 

+ 1) = «mW - e • • //+ • (dim - dima/) 

I 

+ 1) = + e •(/)'•//" • (d,m -^dima^)- (31) 

I 


3.3 Methods Based on Likelihood Ratio Maximization 


In RSLVQ |Ml l()4| . the updating rule for the prototypes is derived from the 
likelihood ratio Eq. ([^ by maximizing this cost function through gradient 
ascent, 


, (t -t-1) = w^ (t) -L e(t)^ log 


Eq. (32) can be re-written as follows: 


/ p(x^,y,|W) \ 

V p(xi|w) ) ■ 


(32) 


+ 1) = wj(t) -L ^ 


(^’yO'lx)-P(j|x))(x-Wj), ifc(wj) = 2/, , . 

-P(j|x)(x-Wj), ifc(Wj)yf?/, 


where e g]0 , 1[ is the learning rate, y is the class label of the data points gen¬ 
erated by component j, and Py[j\x) and P(j|x) are assignment probabilities 
described as follows: 




PiM) = 


p(j)exp (/ (x,Wj ,a|)) 
E{*:c(w,)=y} p(*) exp (/ (x, Wi, CT^)) ’ 

p(j)exp(/(x,Wj,CT2)) 


YJf=i P{i) exp (/ (x, Wi, erf)) 


(34) 


(35) 


Note that Eq. (33) provides a way to extend LVQ to other distance mea¬ 
sures by changing /(x,w, cr^) in Eqs. (34]|^l. 


3.3.1 Distanee Learning Approaehes 

The matrix learning scheme was applied to RSLVQ |61| obtaining what was 
named the Matrix RSLVQ (MRSLVQ) and its local version, Local MRSLVQ 
(LMRSLVQ). The latter consists of a local distance adaptation for each pro¬ 
totype, which in these cases is a full matrix of relevance factors. As it is based 
on RSLVQ, this approach uses the conditional probability density function 
p(x|j) in which component j generates a particular data point x. It is a func¬ 
tion of prototype Wj and it is assumed to have a normalized exponential form 
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p(x|j) = K{j) • exp/(x, Wj, aj). In matrix learning Eq. (|8l) is substituted with 
Eq. (36) where distance d^, defined in Eq. (|^, is used: 


/(x,Wj,cr|,yl) 


d"'(x,w) 

2 fT2 


(36) 


The updating rules for w and 17 are obtained in a way similar to RSLVQ but 
using the adaptive distance as follows: 


cr^ -P(j|x)yl(x-Wj), 


(■PyOW - P(j|x))yl(x - Wj), if c(wj) = y 
j 


if c(wj ) ^ y 


where y is the class label of the data points generated by component j, and 


^lm{t + 1 ) — filmit) -^ • 

“ -P(j|x)) - (1 - (5y,zjP(j|x)) 

•([I7(x - Wj)]/(a;m - (38) 


where I and m are matrix elements. This approach can be extended to the 
local matrix for every prototype as proposed in m- 


3.3.2 Kernel Approaehes 


LVQ methods based on likelihood ratio maximization have been extended to 
feature space by using kernels. A kernelized version of RSLVQ was proposed in 
m and called Kernel Robust Soft Learning Vector Quantization (KRSLVQ). 
The prototypes are represented as a linear combination of data images in the 
feature space as described in Eq. (15). As in RSLVQ, a Gaussian mixture 
model is assumed, but the Euclidean distance is measured in feature space, 
using the distance d^ defined in Eq. (16). Consequently Eq. ([^ is substituted 
with 


/(^>(x),wf,cr2) = _ 


d^ (<?(> 




(39) 


The updating rules of KRSLVQ are 


Wj(t+1) 


, e I (-Py(j|^(xi)) - ^(j|^(x*)))(l - 7jm) 
l--P(il^(Xi))(l-7im) 


if yira ^ Xi, c(w^) = y 
if Xm = Xi, c(w^) = y 
if Xm ^ Xi, c(w^-) ^ y 
if x„, = Xi, c(w^-) ^ y 
(40) 
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3.3.3 Dis-/similarities Approaches 

Dis-/similarities approaches allow extending the likelihood-ratio maximization 
LVQ methods to a pseudo-Euclidean space giving origin to relational RSLVQ 
|28| . The prototypes are represented in an implicit form by a linear combination 
of the data points as in Eq. (181. Remember that D represents a dissimilarity 
matrix which is symmetric and with diagonal elements da = 0. The dissimi¬ 
larities are computed by using Eq. (19l. In this approach the argument of the 
normalized exponential takes the following form: 


/(x,Wj-,cr2) = _ 


d (x, Wj) _ [D ■ aj] - i • a^^Daj 


(41) 


2(7^ 2a2 

Using stochastic gradient ascent the following updating rules are obtained: 

p(x|j) _ p(xi|j) 


-(- 1) I 


1 


5Ii:c(wj) = yP(’^»b) EjP(Xili) 
P(xib') 

i:j:c(w,.) = yP(>^dl) 


. ' [dim dlmCyjml 


lD-aj]-^-a*Daj 


20-2 


where p(xi|j) = RT-exp 
implicitly describes wj through Eq. (18l. 


if c{wj) = y 

5 

if c(wj) ^ y 
(42) 

and aj is a coefficient vector which 


4 Results 

In this section the results of three different experiments are presented. The 
first one employs the Multi-modal dataset which is used for studying 
the sensitivity to the initialization or initial position of the prototypes. The 
second dataset is Image Segmentation m, which is used for comparing the 
performance of the different methods when the nature of the features in the 
data is heterogeneous. Finally, the third dataset is USPS j^S], which allows us 
to compare the performance of the LVQ classifiers of the different methods in 
a real-world problem. 

The multi-modal dataset [S3] has three classes Cl, C2 and C3, with 1200 
training samples per class. The training samples in class C3 are distributed in 
three clusters, while those in classes Cl and C2 have multi-modal distributions. 
Class Cl consists of 15 sub-clusters and the number of samples per cluster are 
50, 50, 50, 50, 50, 50, 50, 50, 50, 150, 150, 150, 100, 100 and 100, respectively. 
Class C2 is composed of 12 sub-clusters and the number of samples per cluster 
are 100, 100, 100, 50, 50, 50, 50, 50, 50, 200, 200 and 200, respectively. 

The USPS dataset consists of 9298 images of handwritten digits 0-9 (10 
classes) of 16x16 pixels in gray scale which are split into 7291 training set 
images and 2007 test set images |25|. In this experiment we used the original 
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Table 1 Summary of 11 LVQ Classifiers 




Name 

Characteristics 

Parameters 

Distance 

Constraint 

LVQ 2.1 

Heuristic 

{e.c^.W} 

Euclidean distance 

d(x, w) = ||x — wjl 

. f (i(3C,W-) d(x,W+) \ ^ 

GLVQ 

Margin Maximization 

{e.W} 

Euclidean distance 
d(x, w) = ||x — w|| 

no 

RSLVQ 

Likelihood Ratio Maximization 

{e.cr, W} 

Euclidean Distance 
d{x,w) = ||x — w|| 

no 

SNG 

Margin Maximization 

{e,W} 

Euclidean Distance 
d{x,w) = ||x — w|| 

no 

SGNG 

Margin Maximization 

{e,W} 

Euclidean Distance 
d(x, w) = ||x — w|| 

no 

H2MLVQ 

Margin Maximization 

{e.W} 

Harmonic to minimum distance 

no 

GRLVQ 

Margin Maximization 
gradient descent 

{e,W} 

Adaptive distance 
(i^(x, w) — 

E. Ai = 1 

GMLVQ 

Margin Maximization 

{e.R, W} 

Adaptive distance 

(x. w) = (x - w)^yl(x - w) 

Matrix A — f2f2^ 
with Aii — 1 

LGRLVQ 

Margin Maximization 
gradient descent 

{e,W} 

Adaptive distance 

d^(x, w) — Xi{xi — Wi)^ 

E. Ai = 1 

LGMLVQ 

Margin Maximization 

{e.R, W} 

Adaptive distance 

d^(x, w) = (x - w)^yl(x - w) 

Matrix A — f2f2^ 
with An — 1 

KRSLVQ 

Likelihood Ratio Maximization 

kernelized 

{e, o-fc, CT, 7} 

Euclidean distance in feature space F 
d{'P{x), w^) = |h(x) — 

no 


dataset and also a subset of 2000 images, which is named USPS*. This smaller 
dataset is used for comparison purposes with other works published in the 
literature. 

The Image Segmentation dataset consists of 2100 samples having 12 fea¬ 
tures which correspond to 3x3 pixel regions extracted from outdoor images. 
There are 7 classes which are: brick-face, sky, foliage, cement, window, path 
and grass m- Because features 3-5 are constants, they were eliminated. 

10 -fold cross validation was used for comparing the performance of the 
different LVQ algorithms. In addition, a multi-comparison statistical test was 
used to compare the means of all pairs of LVQ classifiers. This test involves 
comparing many group means, i.e. pairs of simple t-test comparisons are re¬ 
alized and then a Bonferroni adjustment is done to compensate the critical 
value used in a multiple comparison procedure |261155] . 

For all LVQ classifiers the following learning rate was used: 


e{t) = 


ep 

(1-f T- (t-to))’ 


(43) 


where Cq is the initial value of the learning rate, t = 0.0001 for Multi-modal, 
T = 0.001 for the Image Segmentation dataset, and t = 0.001 for USPS; and 
to is the start time for the learning rate with T^ax = 2000 as the maximum 
number of training epochs for all experiments. 

Table 1 shows a summary of 11 LVQ classifiers chosen for comparison. All 
of them are compared using the 3 datasets described above. 
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Table 2 Summary of parameter values used for each LVQ classifier for the three datasets. 


Multi-modal Image Segmentation USPS 


LVQ2.1 

eg = 0.05 
tg = 0 
s = 0.01 

id 

id 

GLVQ 

eg ^ 0.05 
t ^ 0 

id 

id 

RSLVQ 

eg = 0.05 
tg = 0 

(y'opt — 1.9858 

opt — 0.01 

CTopt = 0.01 

SNG 

eg = 0.05 
tg = 0 

id 

id 

SGNG 

eg = 0.05 
tg = 0 

max. — 45 

Np max = 10 

Np rnax — 30 

H2MLVQ 

eg = 0.05 
tg = 0 

id 

id 

GRLVQ 

eg = 0.05 
tg = 0 

eg = 5 • 10“® 
ig — 500 

tg ^ 100 

tg ^ 100 

GMLVQ 

eg = 0.05 
tg = 0 

eg = 5 • lO”'^ 
tg = 500 
eg* = 1.10-® 

— 500 

to ^ 100 

tg^ ^ 100 

t^ ^ 100 
tg" ^ 100 

LGRLVQ 

eg = 0.05 
tg = 0 

eg = 5-10- = 
tg = 100 

id 

id 

LGMLVQ 

eg = 0.05 
tg = 0 

eg = 1 . 10- = 

tg = 100 
eg* = 5 • 10- = 
tg* = 100 

id 

id 

KRSLVQ 

eg = 0.05 
tg =0 

^opt — 1 

^ opt — 0.01 

o-ppt = 0.5 


Table 2 shows the parameter values used for the 11 LVQ classifiers for the 
three datasets. The expression “id” stands for equal values to those shown in 
the first column. 


4.1 Multi-modal dataset 

Table (the second column) shows the results obtained for the multi-modal 
dataset using 10-fold cross validation. The initialization of prototypes was 
random in the mean of the whole dataset. This setting allows quantifying the 
sensitivity to the initial conditions of each algorithm under study. The number 
of prototypes per class was set to Np = 15. As shown in Table (second 
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Table 3 Average Classification Errors obtained by using 10-fold Cross Validation. (The 
standard deviation is shown within brackets.) 



Multi-modal 

Image Segmentation 

USPS* 

USPS 

LVQ2.1 

0.3289 (0.0400) 

0.2886 (0.0387) 

0.2390 (0.0361) 

0.2700 (0.0954) 

GLVQ 

0.0669 (0.0141) 

0.1205 (0.1497) 

0.0570 (0.0173) 

0.0831 (0.0036) 

RSLVQ 

0.1583 (0.0346) 

0.2124 (0.0539) 

0.0415 (0.0076) 

0.0566 (0.0141) 

SNG 

0.0678 (0.0141) 

0.1205 (0.0300) 

0.0570 (0.0141) 

0.0410 (0.0714) 

SGNG 

0.0732 (0.1873) 

0.2200 (0.2121) 

0.0815 (0.2141) 

0.0922 (0.2213) 

H2MLVQ 

0.0294 (0.0141) 

0.1743 (0.0387) 

0.0530 (0.0632) 

0.0455 (0.0917) 

GRLVQ 

0.1183 (0.0224) 

0.0881 (0.0141) 

0.0970 (0.0245) 

0.0976 (0.1497) 

GMLVQ 

0.1142 (0.0141) 

0.0890 (0.0300) 

0.1165 (0.0224) 

0.1321 (0.0837) 

LGRLVQ 

0.1031 (0.0361) 

0.0531 (0.0224) 

0.1040 (0.0632) 

0.1091(0.0200) 

LGMLVQ 

0.0955 (0.0458) 

0.0357 (0.0224) 

0.1012 (0.0265) 

0.1074 (0.0224) 

KRSLVQ 

0.1133 (0.0387) 

0.1587 (0.0332) 

0.0448 (0.0346) 

0.0514 (0.0224) 


column) the algorithms that update more than one prototype per iteration 
such as H2M-LVQ, SNG and SGNG, obtained the best results together with 
GLVQ. A multi-comparison statistical test was performed which showed that 
H2M-LVQ is significantly better than 6 other algorithms (LVQ 2.1, RSLVQ, 
GMLVQ, LGRLVQ, LGMLVQ, KRSLVQ) with a 95% confidence interval of 
the mean difference. The statistical test indicates that there are 4 algorithms 
that are not significantly different from H2M-LVQ: SGNG, SNG, GLVQ and 
GRLVQ. However, if we take any of these 4 algorithms as a reference, the 
statistical test shows that they are statistically significantly different from only 
3 LVQ classifiers instead of 6 as was done by H2MLVQ. The results in Table 
show that the performance of LVQ 2.1 is inferior to all other algorithms. 
This is because LVQ 2.1 does not have a associated functional and it is very 
sensitive to the initial condition of the prototypes. In practice LVQ 2.1 needs a 
pre-processing technique for finding a good initial position of the prototypes. 
For methods using the adaptive (local) metric, such as: GRLVQ, LGRLVQ, 
GMLVQ and LGMLVQ, the classification error is higher that obtained by 
GLVQ because these methods are similar to GLVQ in the early iterations, 
and are prone to over-fitting the adaptive metric while trying to find a good 
performance. 


4.2 Image Segmentation dataset 

In this experiment the prototypes were initialized in the mean of each class 
to avoid initialization sensitivity. The number of prototypes per class was set 
to Np = 1. The algorithms based on distance learning (matrix or relevance 
learning) reached a better performance, as shown in the third column of Ta¬ 
ble The mean (std) using 10-fold cross validation are shown in this table. 
The lowest classification error was obtained by LGMLVQ with a mean value of 
0.0357. This is significantly better than 6 other algorithms (KRSLVQ, RSLVQ, 
LGRLVQ, SGNG, H2M-LVQ and LVQ) with a 95% confidence interval of the 
mean difference according to the multi-comparison statistical test using Bon- 
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ferroni adjustment. On the other hand, LGMLVQ is not significantly different 
from GLVQ, GRLVQ, SNG and GMLVQ. But, as discussed in the previous 
section, the means of these 4 algorithms are significantly different from only 
three algorithms instead of 6 as was done by LGMLVQ. Between GMLVQ and 
GRLVQ there is no statistically significant difference, with a 95% confidence 
interval of the mean difference. In the case of local distance learning (LGM¬ 
LVQ and LGRLVQ), the matrix method obtained better performance than 
the relevance version, which is significantly different with a confidence interval 
of 95%. The methods based on distance learning obtained better performance 
due to their capacity of modifying the shape of the receptive field of proto¬ 
types, locally or globally, being robust against a dataset with heterogeneous 
features such as the Image Segmentation dataset. 


4.3 USPS dataset 

For both USPS and USPS* datasets the prototypes were initialized in the 
mean of each class and the number of prototypes per class was set to Np = 3. 
The methods based on likelihood ratio maximization, RSLVQ and KRSLVQ, 
obtained higher performance than the methods based on margin maximization. 
For USPS* both RSLVQ and KRSLVQ reached average classification errors 
that are significantly better than the other 6 algorithms with a 95% confi¬ 
dence interval. The best performance was obtained by RSLVQ with a mean 
of 0.0415. The multi-comparison statistical test indicates that the means ob¬ 
tained by RSLVQ and KRSLVQ are not significantly different with a 95% 
confidence interval. The algorithms based on likelihood ratio maximization 
achieved higher performance in this dataset because the best prototypes are 
not necessarily located in the centroids of each cluster, which allows obtaining 
better performance in datasets that are very overlapped. In the case of USPS 
the best performance was obtained by SNG followed by H2M-LVQ, but the 
methods based on likelihood-ratio maximization RSLVQ and KRSLVQ appear 
closely in the third and fourth places as is shown in Table(the fifth column). 


5 Open Problems 

In this section some open problems and challenges in the field of LVQ classifiers 

are presented. 

1. Principled Approach to LVQ. Although GLVQ provides a cost function and 
it can be shown to be a margin maximizer method, the cost function is not 
derived from first principles, such as probabilistic or information theory. 

2. Sparsity. The real world datasets are becoming larger and larger, and the 
computational cost of applying a prototype-based method, such as the 
LVQ classifier, keeps growing. For this reason, recently several sparsity 
approaches have been extended to LVQ classifiers which allow obtaining a 
linear training time without losing classification performance [281 [TT]. 
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3. Semi-supervised learning. In the real world due to the increasing size of 
datasets it is no longer possible to put labels on all samples. For this reason, 
it is necessary to adapt the LVQ classifiers to semi-supervised learning or 
an active learning framework m- 

4. Visualization of LVQ classifiers. Prototype-based methods such as LVQ 
classifiers allow the prototypical representation of the data in the input 
space. This is an advantage because when the prototypes and data can be 
visualized, the classifier can be interpreted easily. But when the LVQ clas¬ 
sifiers work in a space different from the data space, such as in kernelized 
and relational variants, the classifiers are no longer easily visualized and 
interpreted. Not losing this natural capacity of the early LVQ classifiers is 
of interest |24l 146] . 

5. Active learning. For improving the generalization ability of the prototype- 
based methods such as LVQ classifiers, active learning can be used. This 
method gives the learner the capability of selecting samples during training. 
Furthermore, using the active learning approach, the generalization ability 
of the model can be increased as well as its learning speed laiisiiiH]. 


6 Conclusions 

We have presented a review of the most relevant LVQ classifiers developed in 
the last 25 years. We introduced a taxonomy of LVQ classifiers as a general 
framework. Two different main cost functions have been proposed in the liter¬ 
ature: margin maximization and likelihood ratio maximization. LVQ classifiers 
based on margin maximization have been demonstrated to have good perfor¬ 
mance in many problems. These methods put the prototypes in the centroid 
of the data which makes them less flexible with overlapping data. LVQ clas¬ 
sifiers based on a likelihood ratio maximization are an alternative that uses 
a probabilistic approach, in which the prototypes are not put in the centroid 
of the classes, which gives them flexibility in the case of overlapping data. On 
the other hand, the LVQ classifiers based on an adaptive metric have reached 
the best performance in heterogeneous feature datasets. Also, LVQ classifiers 
which update more than one prototype per iteration are less sensitive to initial 
conditions and get better performance in multi-modal datasets. 

Recently, LVQ classifiers such as kernelized or relational LVQ classifiers 
have been based on data representation for improving the performance of the 
classifiers when the data is more complex. With relational LVQ classifiers there 
is a more general representation of the data which allows working with non- 
Euclidean spaces. In this sense, the recent approaches based on dis-/similarities 
capture the inherent data structure naturally which should improve the per¬ 
formance of the classifier. The experiments done in this work have shown that 
there is no free lunch; each method has its own pros and cons. The different 
LVQ methods were designed for dealing with specific problems and datasets. 
From a more general point of view, LVQ classifiers have been demonstrated to 
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be very competitive classifiers, and further research is needed to achieve the 
greatest success in pattern recognition tasks. 
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